This procedure applies the DESeq2 method to normalize RNA-seq read count data. It estimates size factors of RNA-seq samples based on a read count matrix and use the size factors to normalize and transform the matrix. Between-sample variance is compared between data before vs. after normalization. All samples are treated as one group during the normalization.

 

Go to project home

1 Description

1.1 Project

Demo data of DESeq2 normalization

1.2 Data

Ranbomly generated

1.3 Analysis

This is a demo.

2 Analysis and results

2.1 Summary

The read count matrix includes 4 samples and 1000 variables/genes. The average read count per sample is 56.3 and the average read count per variable/gene is 24

Table 1. Summary statistics of read counts per sample, before (B) and after (A) normalization by size factor. (Total: read count summation of all variables/genes; Mean/Median: read count mean/median of all variables/genes; and Variance: standard deviation of read counts after log2(count+1) transformation).

Total_B Total_A Mean_B Mean_A Median_B Median_A Variance_B Variance_A Size_Factor
sample1 39563 52108.53 39.56 52.11 15 19.76 2.11 2.21 0.759
sample2 60312 54469.38 60.31 54.47 22 19.87 2.24 2.21 1.107
sample3 48843 52005.92 48.84 52.01 21 22.36 2.24 2.26 0.939
sample4 76336 54832.88 76.34 54.83 30 21.55 2.34 2.23 1.392

2.2 Size factor

The size factors of all samples range between 0.759 and 1.392 (geometric mean = 1.024). In general, we expect a positive correlation between total read count of a sample and its size factor, and their geometric mean is close to 1.0.

Figure 1. Relationship between read counts before normalization and size factors. Each point represents a samples.


Variables/genes contribute to the calculation of size factors differently as those with higher read counts having more weight. In most RNA-seq data, rRNAs (ribosomal RNAs) and some ‘housekeeping’ genes have the highest read counts, but they are usually not the focus of research interest and often subjected to systemic bias not affecting most of the other genes, such as the efficiency of rRNA depletion. The actual impact of top variables/genes on the values of size factors can be evaluated by removing them from the calculation one-by-one.

Table 2. Size factors after each step of removing top variables/genes. Click column names to view variables/genes removed at each step.

Original Step1 Step2 Step3 Step4 Step5 Step6 Step7 Step8 Step9 Step10
sample1 0.759 0.759 0.760 0.761 0.761 0.762 0.761 0.766 0.770 0.769 0.771
sample2 1.107 1.107 1.110 1.107 1.107 1.093 1.101 1.092 1.092 1.092 1.090
sample3 0.939 0.939 0.940 0.939 0.940 0.944 0.948 0.948 0.948 0.949 0.948
sample4 1.392 1.392 1.389 1.389 1.390 1.390 1.389 1.389 1.388 1.387 1.388

Figure 2. The change of size factor of individual samples, as the top 5% of the variables/genes were removed from calculation.

2.3 Before vs. after normalization

Normalization generally reduces variance between samples, which can be measured by comparing data distribution and calculating sample-sample variance.

Figure 3. Comparison of two samples with the lowest (sample1) and the highest (sample4) total read counts, before vs. after normalization.

Figure 4. Distribution of read counts before and after normalization. Read counts were log2-transformed.

Figure 5. Relationship between mean read count (log2-transformed) and between-sample variance.

3 Download

4 References

  • R: R Development Core Team, 2011. R: A Language and Environment for Statistical Computing. ISBN 3-900051-07-0. Home page.
  • Bioconductor: Gentleman RC et al., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. Home page.
  • Biclustering: Ihmels J, Bergmann S, Barkai N, 2004 Defining transcription modules using large-scale gene expression data. Bioinformatics. Home page.
  • GSEA: Subramanian A et al. 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles PNAS. Home page
  • RoCA
  • Awsomics

5 Reproduce this report

To reproduce this report:

  1. Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory

  2. To generate a new report using your own input data and parameter, edit the following items in the YAML file:

    • output : where you want to put the output files
    • home : the URL if you have a home page for your project
    • analyst : your name
    • description : background information about your project, analysis, etc.
    • input : where are your input data, read instruction for preparing them
    • parameter : parameters for this analysis; read instruction about how to prepare input data
  3. Run the code below within R Console or RStudio, preferablly with a new R session:

if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }

CreateReport(filename.yaml);  # filename.yaml is the YAML file you just downloaded and edited for your analysis

If there is no complaint, go to the output folder and open the index.html file to view report.

6 Session information

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DEGandMore_0.0.0.9000       snow_0.4-3                 
##  [3] rchive_0.0.0.9000           awsomics_0.0.0.9000        
##  [5] colorspace_1.3-2            gplots_3.0.1               
##  [7] MASS_7.3-51.1               htmlwidgets_1.3            
##  [9] DT_0.5                      yaml_2.2.0                 
## [11] kableExtra_0.9.0            knitr_1.20                 
## [13] rmarkdown_1.10              RoCA_0.0.0.9000            
## [15] RCurl_1.95-4.11             bitops_1.0-6               
## [17] usethis_1.4.0               devtools_2.0.1             
## [19] DESeq2_1.22.1               SummarizedExperiment_1.12.0
## [21] DelayedArray_0.8.0          BiocParallel_1.16.2        
## [23] matrixStats_0.54.0          Biobase_2.42.0             
## [25] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [27] IRanges_2.16.0              S4Vectors_0.20.1           
## [29] BiocGenerics_0.28.0        
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.2.6               bit64_0.9-7            httr_1.3.1            
##  [4] RColorBrewer_1.1-2     rprojroot_1.3-2        tools_3.5.1           
##  [7] backports_1.1.2        R6_2.3.0               KernSmooth_2.23-15    
## [10] rpart_4.1-13           Hmisc_4.1-1            DBI_1.0.0             
## [13] lazyeval_0.2.1         nnet_7.3-12            withr_2.1.2           
## [16] gridExtra_2.3          prettyunits_1.0.2      processx_3.2.0        
## [19] bit_1.1-14             compiler_3.5.1         rvest_0.3.2           
## [22] cli_1.0.1              htmlTable_1.12         xml2_1.2.0            
## [25] desc_1.2.0             caTools_1.17.1.1       scales_1.0.0          
## [28] checkmate_1.8.5        readr_1.2.1            genefilter_1.64.0     
## [31] callr_3.0.0            stringr_1.3.1          digest_0.6.18         
## [34] foreign_0.8-71         XVector_0.22.0         pkgconfig_2.0.2       
## [37] base64enc_0.1-3        htmltools_0.3.6        sessioninfo_1.1.1     
## [40] highr_0.7              rlang_0.3.0.1          rstudioapi_0.8        
## [43] RSQLite_2.1.1          gtools_3.8.1           acepack_1.4.1         
## [46] magrittr_1.5           GenomeInfoDbData_1.2.0 Formula_1.2-3         
## [49] Matrix_1.2-15          Rcpp_1.0.0             munsell_0.5.0         
## [52] stringi_1.2.4          zlibbioc_1.28.0        pkgbuild_1.0.2        
## [55] plyr_1.8.4             grid_3.5.1             blob_1.1.1            
## [58] gdata_2.18.0           crayon_1.3.4           lattice_0.20-38       
## [61] splines_3.5.1          annotate_1.60.0        hms_0.4.2             
## [64] locfit_1.5-9.1         ps_1.2.1               pillar_1.3.0          
## [67] geneplotter_1.60.0     pkgload_1.0.2          XML_3.98-1.16         
## [70] glue_1.3.0             evaluate_0.12          latticeExtra_0.6-28   
## [73] data.table_1.11.8      remotes_2.0.2          gtable_0.2.0          
## [76] assertthat_0.2.0       ggplot2_3.1.0          xtable_1.8-3          
## [79] viridisLite_0.3.0      survival_2.43-3        tibble_1.4.2          
## [82] AnnotationDbi_1.44.0   memoise_1.1.0          cluster_2.0.7-1
Go to project home

END OF DOCUMENT